Computer Vision and Pattern Recognition 90
☆ Odd-One-Out: Anomaly Detection by Comparing with Neighbors
This paper introduces a novel anomaly detection (AD) problem that focuses on
identifying `odd-looking' objects relative to the other instances within a
scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this
context are scene-specific, defined by the regular instances that make up the
majority. Since object instances are often partly visible from a single
viewpoint, our setting provides multiple views of each scene as input. To
provide a testbed for future research in this task, we introduce two
benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates
3D object-centric representations for each instance and detects the anomalous
ones through a cross-examination between the instances. We rigorously analyze
our method quantitatively and qualitatively in the presented benchmarks.
comment: Codes & Dataset at https://github.com/VICO-UoE/OddOneOutAD
☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen
Multimodal large language models (MLLMs) have shown impressive success across
modalities such as image, video, and audio in a variety of understanding and
generation tasks. However, current MLLMs are surprisingly poor at understanding
webpage screenshots and generating their corresponding HTML code. To address
this problem, we propose Web2Code, a benchmark consisting of a new large-scale
webpage-to-code dataset for instruction tuning and an evaluation framework for
the webpage understanding and HTML code translation abilities of MLLMs. For
dataset construction, we leverage pretrained LLMs to enhance existing
webpage-to-code datasets as well as generate a diverse pool of new webpages
rendered into images. Specifically, the inputs are webpage images and
instructions, while the responses are the webpage's HTML code. We further
include diverse natural language QA pairs about the webpage content in the
responses to enable a more comprehensive understanding of the web content. To
evaluate model performance in these tasks, we develop an evaluation framework
for testing MLLMs' abilities in webpage understanding and web-to-code
generation. Extensive experiments show that our proposed dataset is beneficial
not only to our proposed tasks but also in the general visual domain, while
previous datasets result in worse performance. We hope our work will contribute
to the development of general MLLMs suitable for web-based content generation
and task automation. Our data and code will be available at
https://github.com/MBZUAI-LLM/web2code.
comment: Website at https://mbzuai-llm.github.io/webpage2code/
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo
Large Language Models (LLMs) equipped with extensive world knowledge and
strong reasoning skills can tackle diverse tasks across domains, often by
posing them as conversation-style instruction-response pairs. In this paper, we
propose LLaRA: Large Language and Robotics Assistant, a framework which
formulates robot action policy as conversations, and provides improved
responses when trained with auxiliary data that complements policy learning.
LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity
to process state information as visual-textual prompts and generate optimal
policy decisions in text. To train such action policy VLMs, we first introduce
an automated pipeline to generate diverse high-quality robotics instruction
data from existing behavior cloning data. A VLM finetuned with the resulting
collection of datasets based on a conversation-style formulation tailored for
robotics tasks, can generate meaningful robot action policy decisions. Our
experiments across multiple simulated and real-world environments demonstrate
the state-of-the-art performance of the proposed LLaRA framework. The code,
datasets, and pretrained models are available at
https://github.com/LostXine/LLaRA.
☆ LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
While significant advancements have been made in compressed representations
for text embeddings in large language models (LLMs), the compression of visual
tokens in large multi-modal models (LMMs) has remained a largely overlooked
area. In this work, we present the study on the analysis of redundancy
concerning visual tokens and efficient training within these models. Our
initial experiments show that eliminating up to 70% of visual tokens at the
testing stage by simply average pooling only leads to a minimal 3% reduction in
visual question answering accuracy on the GQA benchmark, indicating significant
redundancy in visual context. Addressing this, we introduce Visual Context
Compressor, which reduces the number of visual tokens during training to
enhance training efficiency without sacrificing performance. To minimize
information loss caused by the compression on visual tokens while maintaining
training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta
incorporates stage-wise visual context compression to progressively compress
the visual tokens from heavily to lightly, and finally no compression at the
end of training, yielding no loss of information when testing. Extensive
experiments demonstrate that our approach enhances the performance of MLLMs in
both image-language and video-language understanding, while also significantly
cutting training costs. Code is available at
https://github.com/Beckschen/LLaVolta
comment: Code is available at https://github.com/Beckschen/LLaVolta
☆ Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Diffusion-based models have shown great potential in generating high-quality
images with various layouts, which can benefit downstream perception tasks.
However, a fully automatic layout generation driven only by language and a
suitable metric for measuring multiple generated instances has not been well
explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework
that generates high-quality multi-modal training examples to augment perception
and multi-modal training. Starting with a simple list of natural language
concepts, we prompt large language models (LLMs) to generate a detailed
description and design reasonable layouts. Next, we use an off-the-shelf
text-to-image model to generate multiple images. Then, the generated data are
refined using a comprehensively designed metric to ensure quality. In
particular, we present a new metric, Composite Layout and Image Score (CLIS),
to evaluate the generated images fairly. Our synthetic high-quality examples
boost performance in various scenarios by customizing the initial concept list,
especially in addressing challenges associated with long-tailed distribution
and imbalanced datasets. Experiment results on downstream tasks demonstrate
that Auto Cherry-Picker can significantly improve the performance of existing
models. In addition, we have thoroughly investigated the correlation between
CLIS and performance gains in downstream tasks, and we find that a better CLIS
score results in better performance. This finding shows the potential for
evaluation metrics as the role for various visual perception and MLLM tasks.
Code will be available.
comment: 19 pages, 7 figures
☆ PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
Kuo-Hao Zeng, Zichen Zhang, Kiana Ehsani, Rose Hendrix, Jordi Salvador, Alvaro Herrasti, Ross Girshick, Aniruddha Kembhavi, Luca Weihs
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation
agent trained end-to-end with reinforcement learning at scale that generalizes
to the real-world without adaptation despite being trained purely in
simulation. PoliFormer uses a foundational vision transformer encoder with a
causal transformer decoder enabling long-term memory and reasoning. It is
trained for hundreds of millions of interactions across diverse environments,
leveraging parallelized, multi-machine rollouts for efficient training with
high throughput. PoliFormer is a masterful navigator, producing
state-of-the-art results across two distinct embodiments, the LoCoBot and
Stretch RE-1 robots, and four navigation benchmarks. It breaks through the
plateaus of previous work, achieving an unprecedented 85.5% success rate in
object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement.
PoliFormer can also be trivially extended to a variety of downstream
applications such as object tracking, multi-object navigation, and
open-vocabulary navigation with no finetuning.
☆ Segment Anything without Supervision
The Segmentation Anything Model (SAM) requires labor-intensive data labeling.
We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image
segmentation that does not require human annotations. UnSAM utilizes a
divide-and-conquer strategy to "discover" the hierarchical structure of visual
scenes. We first leverage top-down clustering methods to partition an unlabeled
image into instance/semantic level segments. For all pixels within a segment, a
bottom-up clustering method is employed to iteratively merge them into larger
groups, thereby forming a hierarchical structure. These unsupervised
multi-granular masks are then utilized to supervise model training. Evaluated
across seven popular datasets, UnSAM achieves competitive results with the
supervised counterpart SAM, and surpasses the previous state-of-the-art in
unsupervised segmentation by 11% in terms of AR. Moreover, we show that
supervised SAM can also benefit from our self-supervised labels. By integrating
our unsupervised pseudo masks into SA-1B's ground-truth masks and training
UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment
entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP
by 3.9% on SA-1B.
comment: Code: https://github.com/frank-xwang/UnSAM
☆ GM-DF: Generalized Multi-Scenario Deepfake Detection
Existing face forgery detection usually follows the paradigm of training
models in a single domain, which leads to limited generalization capacity when
unseen scenarios and unknown attacks occur. In this paper, we elaborately
investigate the generalization capacity of deepfake detection models when
jointly trained on multiple face forgery detection datasets. We first find a
rapid degradation of detection accuracy when models are directly trained on
combined datasets due to the discrepancy across collection scenarios and
generation methods. To address the above issue, a Generalized Multi-Scenario
Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world
scenarios by a unified model. First, we propose a hybrid expert modeling
approach for domain-specific real/forgery feature extraction. Besides, as for
the commonality representation, we use CLIP to extract the common features for
better aligning visual and textual features across domains. Meanwhile, we
introduce a masked image reconstruction mechanism to force models to capture
rich forged details. Finally, we supervise the models via a domain-aware
meta-learning strategy to further enhance their generalization capacities.
Specifically, we design a novel domain alignment loss to strongly align the
distributions of the meta-test domains and meta-train domains. Thus, the
updated models are able to represent both specific and common real/forgery
features across multiple datasets. In consideration of the lack of study of
multi-dataset training, we establish a new benchmark leveraging multi-source
data to fairly evaluate the models' generalization capacity on unseen
scenarios. Both qualitative and quantitative experiments on five datasets
conducted on traditional protocols as well as the proposed benchmark
demonstrate the effectiveness of our approach.
☆ HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model
We introduce HouseCrafter, a novel approach that can lift a floorplan into a
complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a
2D diffusion model, which is trained on web-scale images, to generate
consistent multi-view color (RGB) and depth (D) images across different
locations of the scene. Specifically, the RGB-D images are generated
autoregressively in a batch-wise manner along sampled locations based on the
floorplan, where previously generated images are used as condition to the
diffusion model to produce images at nearby locations. The global floorplan and
attention design in the diffusion model ensures the consistency of the
generated images, from which a 3D scene can be reconstructed. Through extensive
evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate
high-quality house-scale 3D scenes. Ablation studies also validate the
effectiveness of different design choices. We will release our code and model
weights. Project page: https://neu-vi.github.io/houseCrafter/
☆ EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang, Tianheng Cheng, Rui Hu, ei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
Segment Anything Model (SAM) has attracted widespread attention for its
superior interactive segmentation capabilities with visual prompts while
lacking further exploration of text prompts. In this paper, we empirically
investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting
SAM for referring expression segmentation and introduce the Early
Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective
referring segmentation method which exploits multimodal prompts (i.e., image
and text) and comprises a pre-trained vision-language model to generate
referring prompts and a SAM model for segmentation. Surprisingly, we observe
that: (1) multimodal prompts and (2) vision-language models with early fusion
(e.g., BEIT-3) are beneficial for prompting SAM for accurate referring
segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3
can obtain state-of-the-art performance on RefCOCO/+/g for referring expression
segmentation and demonstrate the superiority of prompting SAM with early
vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters
achieves remarkably higher performance while reducing nearly 82% of parameters
compared to previous SAM methods based on large multimodal models.
comment: Preprint
☆ ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction
NeRF-based methods reconstruct 3D scenes by building a radiance field with
implicit or explicit representations. While NeRF-based methods can perform
novel view synthesis (NVS) at arbitrary scale, the performance in
high-resolution novel view synthesis (HRNVS) with low-resolution (LR)
optimization often results in oversmoothing. On the other hand, single-image
super-resolution (SR) aims to enhance LR images to HR counterparts but lacks
multi-view consistency. To address these challenges, we propose Arbitrary-Scale
Super-Resolution NeRF (ASSR-NeRF), a novel framework for super-resolution novel
view synthesis (SRNVS). We propose an attention-based VoxelGridSR model to
directly perform 3D super-resolution (SR) on the optimized volume. Our model is
trained on diverse scenes to ensure generalizability. For unseen scenes trained
with LR views, we then can directly apply our VoxelGridSR to further refine the
volume and achieve multi-view consistent SR. We demonstrate quantitative and
qualitatively that the proposed method achieves significant performance in
SRNVS.
☆ SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting
Sara Sabour, Lily Goli, George Kopanas, Mark Matthews, Dmitry Lagun, Leonidas Guibas, Alec Jacobson, David J. Fleet, Andrea Tagliasacchi
3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction,
offering efficient training and rendering speeds, making it suitable for
real-time applications.However, current methods require highly controlled
environments (no moving people or wind-blown elements, and consistent lighting)
to meet the inter-view consistency assumption of 3DGS. This makes
reconstruction of real-world captures problematic. We present SpotlessSplats,
an approach that leverages pre-trained and general-purpose features coupled
with robust optimization to effectively ignore transient distractors. Our
method achieves state-of-the-art reconstruction quality both visually and
quantitatively, on casual captures.
☆ HAITCH: A Framework for Distortion and Motion Correction in Fetal Multi-Shell Diffusion-Weighted MRI
Diffusion magnetic resonance imaging (dMRI) is pivotal for probing the
microstructure of the rapidly-developing fetal brain. However, fetal motion
during scans and its interaction with magnetic field inhomogeneities result in
artifacts and data scattering across spatial and angular domains. The effects
of those artifacts are more pronounced in high-angular resolution fetal dMRI,
where signal-to-noise ratio is very low. Those effects lead to biased estimates
and compromise the consistency and reliability of dMRI analysis. This work
presents HAITCH, the first and the only publicly available tool to correct and
reconstruct multi-shell high-angular resolution fetal dMRI data. HAITCH offers
several technical advances that include a blip-reversed dual-echo acquisition
for dynamic distortion correction, advanced motion correction for model-free
and robust reconstruction, optimized multi-shell design for enhanced
information capture and increased tolerance to motion, and outlier detection
for improved reconstruction fidelity. The framework is open-source, flexible,
and can be used to process any type of fetal dMRI data including single-echo or
single-shell acquisitions, but is most effective when used with multi-shell
multi-echo fetal dMRI data that cannot be processed with any of the existing
tools. Validation experiments on real fetal dMRI scans demonstrate significant
improvements and accurate correction across diverse fetal ages and motion
levels. HAITCH successfully removes artifacts and reconstructs high-fidelity
fetal dMRI data suitable for advanced diffusion modeling, including fiber
orientation distribution function estimation. These advancements pave the way
for more reliable analysis of the fetal brain microstructure and tractography
under challenging imaging conditions.
☆ eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking
The unique complementarity of frame-based and event cameras for high frame
rate object tracking has recently inspired some research attempts to develop
multi-modal fusion approaches. However, these methods directly fuse both
modalities and thus ignore the environmental attributes, e.g., motion blur,
illumination variance, occlusion, scale variation, etc. Meanwhile, no
interaction between search and template features makes distinguishing target
objects and backgrounds difficult. As a result, performance degradation is
induced especially in challenging conditions. This paper proposes a novel and
effective Transformer-based event-guided tracking framework, called
eMoE-Tracker, which achieves new SOTA performance under various conditions. Our
key idea is to disentangle the environment into several learnable attributes to
dynamically learn the attribute-specific features for better interaction and
discriminability between the target information and background. To achieve the
goal, we first propose an environmental Mix-of-Experts (eMoE) module that is
built upon the environmental Attributes Disentanglement to learn
attribute-specific features and environmental Attributes Gating to assemble the
attribute-specific features by the learnable attribute scores dynamically. The
eMoE module is a subtle router that fine-tunes the transformer backbone more
efficiently. We then introduce a contrastive relation modeling (CRM) module to
improve interaction and discriminability between the target information and
background. Extensive experiments on diverse event-based benchmark datasets
showcase the superior performance of our eMoE-Tracker compared to the prior
arts.
comment: RGB-event single object tracking
☆ Malaria Cell Detection Using Deep Neural Networks
Malaria remains one of the most pressing public health concerns globally,
causing significant morbidity and mortality, especially in sub-Saharan Africa.
Rapid and accurate diagnosis is crucial for effective treatment and disease
management. Traditional diagnostic methods, such as microscopic examination of
blood smears, are labor-intensive and require significant expertise, which may
not be readily available in resource-limited settings. This project aims to
automate the detection of malaria-infected cells using a deep learning
approach. We employed a convolutional neural network (CNN) based on the
ResNet50 architecture, leveraging transfer learning to enhance performance. The
Malaria Cell Images Dataset from Kaggle, containing 27,558 images categorized
into infected and uninfected cells, was used for training and evaluation. Our
model demonstrated high accuracy, precision, and recall, indicating its
potential as a reliable tool for assisting in malaria diagnosis. Additionally,
a web application was developed using Streamlit to allow users to upload cell
images and receive predictions about malaria infection, making the technology
accessible and user-friendly. This paper provides a comprehensive overview of
the methodology, experiments, and results, highlighting the effectiveness of
deep learning in medical image analysis.
☆ Wavelets Are All You Need for Autoregressive Image Generation
In this paper, we take a new approach to autoregressive image generation that
is based on two main ingredients. The first is wavelet image coding, which
allows to tokenize the visual details of an image from coarse to fine details
by ordering the information starting with the most significant bits of the most
significant wavelet coefficients. The second is a variant of a language
transformer whose architecture is re-designed and optimized for token sequences
in this 'wavelet language'. The transformer learns the significant statistical
correlations within a token sequence, which are the manifestations of
well-known correlations between the wavelet subbands at various resolutions. We
show experimental results with conditioning on the generation process.
comment: 16 pages, 10 figures
☆ STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical
Large Vision-Language Models (LVLMs) have shown significant potential in
assisting medical diagnosis by leveraging extensive biomedical datasets.
However, the advancement of medical image understanding and reasoning
critically depends on building high-quality visual instruction data, which is
costly and labor-intensive to obtain, particularly in the medical domain. To
mitigate this data-starving issue, we introduce Self-Training Large Language
and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed
to train a policy model (an LVLM) capable of auto-generating medical visual
instruction data to improve data efficiency, guided through Direct Preference
Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g.,
GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning
process on the auto-generated data, encouraging the policy model to align
efficiently with human preferences. We validate the efficacy and data
efficiency of STLLaVA-Med across three major medical Visual Question Answering
(VQA) benchmarks, demonstrating competitive zero-shot performance with the
utilization of only 9% of the medical data.
comment: 10 pages, 6 figures
☆ Impact of Initialization on Intra-subject Pediatric Brain MR Image Registration: A Comparative Analysis between SyN ANTs and Deep Learning-Based Approaches
This study evaluates the performance of conventional SyN ANTs and
learning-based registration methods in the context of pediatric neuroimaging,
specifically focusing on intrasubject deformable registration. The comparison
involves three approaches: without (NR), with rigid (RR), and with rigid and
affine (RAR) initializations. In addition to initialization, performances are
evaluated in terms of accuracy, speed, and the impact of age intervals and sex
per pair. Data consists of the publicly available MRI scans from the Calgary
Preschool dataset, which includes 63 children aged 2-7 years, allowing for 431
registration pairs. We implemented the unsupervised DL framework with a U-Net
architecture using DeepReg and it was 5-fold cross-validated. Evaluation
includes Dice scores for tissue segmentation from 18 smaller regions obtained
by SynthSeg, analysis of log Jacobian determinants, and registration pro-rated
training and inference times. Learning-based approaches, with or without linear
initializations, exhibit slight superiority over SyN ANTs in terms of Dice
scores. Indeed, DL-based implementations with RR and RAR initializations
significantly outperform SyN ANTs. Both SyN ANTs and DL-based registration
involve parameter optimization, but the choice between these methods depends on
the scale of registration: network-based for broader coverage or SyN ANTs for
specific structures. Both methods face challenges with larger age intervals due
to greater growth changes. The main takeaway is that while DL-based methods
show promise with faster and more accurate registrations, SyN ANTs remains
robust and generalizable without the need for extensive training, highlighting
the importance of method selection based on specific registration needs in the
pediatric context. Our code is available at
https://github.com/neuropoly/pediatric-DL-registration
comment: Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2024:013
☆ GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection
As DeepFake video manipulation techniques escalate, posing profound threats,
the urgent need to develop efficient detection strategies is underscored.
However, one particular issue lies with facial images being mis-detected, often
originating from degraded videos or adversarial attacks, leading to unexpected
temporal artifacts that can undermine the efficacy of DeepFake video detection
techniques. This paper introduces a novel method for robust DeepFake video
detection, harnessing the power of the proposed Graph-Regularized Attentive
Convolutional Entanglement (GRACE) based on the graph convolutional network
with graph Laplacian to address the aforementioned challenges. First,
conventional Convolution Neural Networks are deployed to perform spatiotemporal
features for the entire video. Then, the spatial and temporal features are
mutually entangled by constructing a graph with sparse constraint, enforcing
essential features of valid face images in the noisy face sequences remaining,
thus augmenting stability and performance for DeepFake video detection.
Furthermore, the Graph Laplacian prior is proposed in the graph convolutional
network to remove the noise pattern in the feature space to further improve the
performance. Comprehensive experiments are conducted to illustrate that our
proposed method delivers state-of-the-art performance in DeepFake video
detection under noisy face sequences. The source code is available at
https://github.com/ming053l/GRACE.
comment: Submitted to TPAMI 2024
☆ Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping
Large parallax between images is an intractable issue in image stitching.
Various warping-based methods are proposed to address it, yet the results are
unsatisfactory. In this paper, we propose a novel image stitching method using
multi-homography warping guided by image segmentation. Specifically, we
leverage the Segment Anything Model to segment the target image into numerous
contents and partition the feature points into multiple subsets via the
energy-based multi-homography fitting algorithm. The multiple subsets of
feature points are used to calculate the corresponding multiple homographies.
For each segmented content in the overlapping region, we select its
best-fitting homography with the lowest photometric error. For each segmented
content in the non-overlapping region, we calculate a weighted combination of
the linearized homographies. Finally, the target image is warped via the
best-fitting homographies to align with the reference image, and the final
panorama is generated via linear blending. Comprehensive experimental results
on the public datasets demonstrate that our method provides the best alignment
accuracy by a large margin, compared with the state-of-the-art methods. The
source code is available at https://github.com/tlliao/multi-homo-warp.
comment: 11 pages, 9 figures
☆ Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
The Mixture-of-Experts (MoE) has gained increasing attention in the study of
Large Vision-Language Models (LVLMs). It uses a sparse model to replace the
dense model, achieving comparable performance while activating fewer parameters
during inference, thus significantly reducing the inference cost. Existing MoE
methods in LVLMs encourage different experts to handle different tokens, and
thus they employ a router to predict the routing for each token. However, the
predictions are based solely on sample features and do not truly reveal the
optimization direction of tokens. This can lead to severe optimization
conflicts between different tokens within an expert. To address this problem,
this paper proposes a novel method based on token-level gradient analysis.
Specifically, we first use token-level gradients to identify conflicting tokens
in experts. Then, we add a specialized loss tailored to eliminate conflicts
among tokens within each expert. Our method can serve as a plug-in for diverse
Large Vision-Language Models, and extensive experimental results demonstrate
the effectiveness of our method. The code will be publicly available at
https://github.com/longrongyang/STGC.
☆ On the Value of PHH3 for Mitotic Figure Detection on H&E-stained Images
Jonathan Ganz, Christian Marzahl, Jonas Ammeling, Barbara Richter, Chloé Puget, Daniela Denk, Elena A. Demeter, Flaviu A. Tabaran, Gabriel Wasinger, Karoline Lipnik, Marco Tecilla, Matthew J. Valentine, Michael J. Dark, Niklas Abele, Pompei Bolfa, Ramona Erber, Robert Klopfleisch, Sophie Merz, Taryn A. Donovan, Samir Jabari, Christof A. Bertram, Katharina Breininger, Marc Aubreville
The count of mitotic figures (MFs) observed in hematoxylin and eosin
(H&E)-stained slides is an important prognostic marker as it is a measure for
tumor cell proliferation. However, the identification of MFs has a known low
inter-rater agreement. Deep learning algorithms can standardize this task, but
they require large amounts of annotated data for training and validation.
Furthermore, label noise introduced during the annotation process may impede
the algorithm's performance. Unlike H&E, the mitosis-specific antibody
phospho-histone H3 (PHH3) specifically highlights MFs. Counting MFs on slides
stained against PHH3 leads to higher agreement among raters and has therefore
recently been used as a ground truth for the annotation of MFs in H&E. However,
as PHH3 facilitates the recognition of cells indistinguishable from H&E stain
alone, the use of this ground truth could potentially introduce noise into the
H&E-related dataset, impacting model performance. This study analyzes the
impact of PHH3-assisted MF annotation on inter-rater reliability and object
level agreement through an extensive multi-rater experiment. We found that the
annotators' object-level agreement increased when using PHH3-assisted labeling.
Subsequently, MF detectors were evaluated on the resulting datasets to
investigate the influence of PHH3-assisted labeling on the models' performance.
Additionally, a novel dual-stain MF detector was developed to investigate the
interpretation-shift of PHH3-assisted labels used in H&E, which clearly
outperformed single-stain detectors. However, the PHH3-assisted labels did not
have a positive effect on solely H&E-based models. The high performance of our
dual-input detector reveals an information mismatch between the H&E and
PHH3-stained images as the cause of this effect.
comment: 10 pages, 5 figures, 1 Table
☆ InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
Understanding long videos, ranging from tens of minutes to several hours,
presents unique challenges in video comprehension. Despite the increasing
importance of long-form video content, existing benchmarks primarily focus on
shorter clips. To address this gap, we introduce InfiniBench a comprehensive
benchmark for very long video understanding which presents 1)The longest video
duration, averaging 76.34 minutes; 2) The largest number of question-answer
pairs, 108.2K; 3) Diversity in questions that examine nine different skills and
include both multiple-choice questions and open-ended questions; 4)
Humancentric, as the video sources come from movies and daily TV shows, with
specific human-level question designs such as Movie Spoiler Questions that
require critical thinking and comprehensive understanding. Using InfiniBench,
we comprehensively evaluate existing Large MultiModality Models (LMMs) on each
skill, including the commercial model Gemini 1.5 Flash and the open-source
models. The evaluation shows significant challenges in our benchmark.Our
results show that the best AI models such Gemini struggles to perform well with
42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark
will stimulate the LMMs community towards long video and human-level
understanding. Our benchmark can be accessed at
https://vision-cair.github.io/InfiniBench/
comment: 16 page ,17 figures
☆ FootBots: A Transformer-based Architecture for Motion Prediction in Soccer ICIP 2024
Motion prediction in soccer involves capturing complex dynamics from player
and ball interactions. We present FootBots, an encoder-decoder
transformer-based architecture addressing motion prediction and conditioned
motion prediction through equivariance properties. FootBots captures temporal
and social dynamics using set attention blocks and multi-attention block
decoder. Our evaluation utilizes two datasets: a real soccer dataset and a
tailored synthetic one. Insights from the synthetic dataset highlight the
effectiveness of FootBots' social attention mechanism and the significance of
conditioned motion prediction. Empirical results on real soccer data
demonstrate that FootBots outperforms baselines in motion prediction and excels
in conditioned tasks, such as predicting the players based on the ball
position, predicting the offensive (defensive) team based on the ball and the
defensive (offensive) team, and predicting the ball position based on all
players. Our evaluation connects quantitative and qualitative findings.
https://youtu.be/9kaEkfzG3L8
comment: Published as a conference paper at IEEE ICIP 2024
☆ StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction
3D multi-object tracking and trajectory prediction are two crucial modules in
autonomous driving systems. Generally, the two tasks are handled separately in
traditional paradigms and a few methods have started to explore modeling these
two tasks in a joint manner recently. However, these approaches suffer from the
limitations of single-frame training and inconsistent coordinate
representations between tracking and prediction tasks. In this paper, we
propose a streaming and unified framework for joint 3D Multi-Object Tracking
and trajectory Prediction (StreamMOTP) to address the above challenges.
Firstly, we construct the model in a streaming manner and exploit a memory bank
to preserve and leverage the long-term latent features for tracked objects more
effectively. Secondly, a relative spatio-temporal positional encoding strategy
is introduced to bridge the gap of coordinate representations between the two
tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we
further improve the quality and consistency of predicted trajectories with a
dual-stream predictor. We conduct extensive experiments on popular nuSences
dataset and the experimental results demonstrate the effectiveness and
superiority of StreamMOTP, which outperforms previous methods significantly on
both tasks. Furthermore, we also prove that the proposed framework has great
potential and advantages in actual applications of autonomous driving.
☆ LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation
We present LightStereo, a cutting-edge stereo-matching network crafted to
accelerate the matching process. Departing from conventional methodologies that
rely on aggregating computationally intensive 4D costs, LightStereo adopts the
3D cost volume as a lightweight alternative. While similar approaches have been
explored previously, our breakthrough lies in enhancing performance through a
dedicated focus on the channel dimension of the 3D cost volume, where the
distribution of matching costs is encapsulated. Our exhaustive exploration has
yielded plenty of strategies to amplify the capacity of the pivotal dimension,
ensuring both precision and efficiency. We compare the proposed LightStereo
with existing state-of-the-art methods across various benchmarks, which
demonstrate its superior performance in speed, accuracy, and resource
utilization. LightStereo achieves a competitive EPE metric in the SceneFlow
datasets while demanding a minimum of only 22 GFLOPs, with an inference time of
just 17 ms. Our comprehensive analysis reveals the effect of 2D cost
aggregation for stereo matching, paving the way for real-world applications of
efficient stereo systems. Code will be available at
\url{https://github.com/XiandaGuo/OpenStereo}.
comment: Code will be available at
\url{https://github.com/XiandaGuo/OpenStereo}
☆ Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features
Adversarial attack on skeletal motion is a hot topic. However, existing
researches only consider part of dynamic features when measuring distance
between skeleton graph sequences, which results in poor imperceptibility. To
this end, we propose a novel adversarial attack method to attack action
recognizers for skeletal motions. Firstly, our method systematically proposes a
dynamic distance function to measure the difference between skeletal motions.
Meanwhile, we innovatively introduce emotional features for complementary
information. In addition, we use Alternating Direction Method of
Multipliers(ADMM) to solve the constrained optimization problem, which
generates adversarial samples with better imperceptibility to deceive the
classifiers. Experiments show that our method is effective on multiple action
classifiers and datasets. When the perturbation magnitude measured by l norms
is the same, the dynamic perturbations generated by our method are much lower
than that of other methods. What's more, we are the first to prove the
effectiveness of emotional features, and provide a new idea for measuring the
distance between skeletal motions.
☆ Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes
The emerging task of fine-grained image classification in low-data regimes
assumes the presence of low inter-class variance and large intra-class
variation along with a highly limited amount of training samples per class.
However, traditional ways of separately dealing with fine-grained
categorisation and extremely scarce data may be inefficient under both these
harsh conditions presented together. In this paper, we present a novel
framework, called AD-Net, aiming to enhance deep neural network performance on
this challenge by leveraging the power of Augmentation and Distillation
techniques. Specifically, our approach is designed to refine learned features
through self-distillation on augmented samples, mitigating harmful overfitting.
We conduct comprehensive experiments on popular fine-grained image
classification benchmarks where our AD-Net demonstrates consistent improvement
over traditional fine-tuning and state-of-the-art low-data techniques.
Remarkably, with the smallest data available, our framework shows an
outstanding relative accuracy increase of up to 45 % compared to standard
ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise
that our approach is practically architecture-independent and adds zero extra
cost at inference time. Additionally, we provide an extensive study on the
impact of every framework's component, highlighting the importance of each in
achieving optimal performance. Source code and trained models are publicly
available at github.com/demidovd98/fgic_lowd.
comment: Main paper and Appendices
☆ EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting
Daiwei Zhang, Gengyan Li, Jiajie Li, Mickaël Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, Xi Wang
Human activities are inherently complex, and even simple household tasks
involve numerous object interactions. To better understand these activities and
behaviors, it is crucial to model their dynamic interactions with the
environment. The recent availability of affordable head-mounted cameras and
egocentric data offers a more accessible and efficient means to understand
dynamic human-object interactions in 3D environments. However, most existing
methods for human activity modeling either focus on reconstructing 3D models of
hand-object or human-scene interactions or on mapping 3D scenes, neglecting
dynamic interactions with objects. The few existing solutions often require
inputs from multiple sources, including multi-camera setups, depth-sensing
cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the
first method capable of simultaneously reconstructing 3D scenes and dynamically
tracking 3D object motion from RGB egocentric input alone. We leverage the
uniquely discrete nature of Gaussian Splatting and segment dynamic interactions
from the background. Our approach employs a clip-level online learning pipeline
that leverages the dynamic nature of human activities, allowing us to
reconstruct the temporal evolution of the scene in chronological order and
track rigid object motion. Additionally, our method automatically segments
object and background Gaussians, providing 3D representations for both static
scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic
Gaussian methods in challenging in-the-wild videos and we also qualitatively
demonstrate the high quality of the reconstructed models.
☆ Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting MICCAI24
Generalist segmentation models are increasingly favored for diverse tasks
involving various objects from different image sources. Task-Incremental
Learning (TIL) offers a privacy-preserving training paradigm using tasks
arriving sequentially, instead of gathering them due to strict data sharing
policies. However, the task evolution can span a wide scope that involves
shifts in both image appearance and segmentation semantics with intricate
correlation, causing concurrent appearance and semantic forgetting. To solve
this issue, we propose a Comprehensive Generative Replay (CGR) framework that
restores appearance and semantic knowledge by synthesizing image-mask pairs to
mimic past task data, which focuses on two aspects: modeling image-mask
correspondence and promoting scalability for diverse tasks. Specifically, we
introduce a novel Bayesian Joint Diffusion (BJD) model for high-quality
synthesis of image-mask pairs with their correspondence explicitly preserved by
conditional denoising. Furthermore, we develop a Task-Oriented Adapter (TOA)
that recalibrates prompt embeddings to modulate the diffusion model, making the
data synthesis compatible with different tasks. Experiments on incremental
tasks (cardiac, fundus and prostate segmentation) show its clear advantage for
alleviating concurrent appearance and semantic forgetting. Code is available at
https://github.com/jingyzhang/CGR.
comment: Accepted by MICCAI24
☆ Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train
The complex structure of the heart leads to significant challenges in
echocardiography, especially in acquisition cardiac ultrasound images.
Successful echocardiography requires a thorough understanding of the structures
on the two-dimensional plane and the spatial relationships between planes in
three-dimensional space. In this paper, we innovatively propose a large-scale
self-supervised pre-training method to acquire a cardiac structure-aware world
model. The core innovation lies in constructing a self-supervised task that
requires structural inference by predicting masked structures on a 2D plane and
imagining another plane based on pose transformation in 3D space. To support
large-scale pre-training, we collected over 1.36 million echocardiograms from
ten standard views, along with their 3D spatial poses. In the downstream probe
guidance task, we demonstrate that our pre-trained model consistently reduces
guidance errors across the ten most common standard views on the test set with
0.29 million samples from 74 routine clinical scans, indicating that
structure-aware pre-training benefits the scanning.
comment: Technical report
☆ SPIRONet: Spatial-Frequency Learning and Topological Channel Interaction Network for Vessel Segmentation
De-Xing Huang, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, Zeng-Guang Hou
Automatic vessel segmentation is paramount for developing next-generation
interventional navigation systems. However, current approaches suffer from
suboptimal segmentation performances due to significant challenges in
intraoperative images (i.e., low signal-to-noise ratio, small or slender
vessels, and strong interference). In this paper, a novel spatial-frequency
learning and topological channel interaction network (SPIRONet) is proposed to
address the above issues. Specifically, dual encoders are utilized to
comprehensively capture local spatial and global frequency vessel features.
Then, a cross-attention fusion module is introduced to effectively fuse spatial
and frequency features, thereby enhancing feature discriminability.
Furthermore, a topological channel interaction module is designed to filter out
task-irrelevant responses based on graph neural networks. Extensive
experimental results on several challenging datasets (CADSA, CAXF, DCA1, and
XCAD) demonstrate state-of-the-art performances of our method. Moreover, the
inference speed of SPIRONet is 21 FPS with a 512x512 input size, surpassing
clinical real-time requirements (6~12FPS). These promising outcomes indicate
SPIRONet's potential for integration into vascular interventional navigation
systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and
high-quality visual instruction data designed to enhance the
instruction-following capabilities of large multimodal models (LMMs). While
existing visual instruction datasets often focus on question-answering, they
struggle to generalize to broader application scenarios such as creative
writing, summarization, or image analysis. To address these limitations, we
propose a novel approach to constructing MM-Instruct that leverages the strong
instruction-following capabilities of existing LLMs to generate novel visual
instruction data from large-scale but conventional image captioning datasets.
MM-Instruct first leverages ChatGPT to automatically generate diverse
instructions from a small set of seed instructions through augmenting and
summarization. It then matches these instructions with images and uses an
open-sourced large language model (LLM) to generate coherent answers to the
instruction-image pairs. The LLM is grounded by the detailed text descriptions
of images in the whole answer generation process to guarantee the alignment of
the instruction data. Moreover, we introduce a benchmark based on the generated
instruction data to evaluate the instruction-following capabilities of existing
LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5
model on the generated data, denoted as LLaVA-Instruct, which exhibits
significant improvements in instruction-following capabilities compared to
LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models
are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at
https://github.com/jihaonew/MM-Instruct
☆ EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans
Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of
human joints from a single 2D image captured by a camera. However, a single 2D
point in the image may correspond to multiple points in 3D space. Typically,
the uniqueness of the 2D-3D relationship is approximated using an orthographic
or weak-perspective camera model. In this study, instead of relying on
approximations, we advocate for utilizing the full perspective camera model.
This involves estimating camera parameters and establishing a precise,
unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework,
comprising two main components: the pose lifter network (LiftNet) and the pose
regressor network (RegNet). LiftNet utilizes the full perspective camera model
to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose
and camera parameters as inputs and produces the corresponding 3D pose
estimation. These inputs are obtained from RegNet, which starts from a single
image and provides estimates for the 2D pose and camera parameters. RegNet
utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a
3D pose, which is then projected to 2D using the estimated camera parameters.
This process enables RegNet to establish the unambiguous 2D-3D relationship.
Our experiments show that modeling the lifting as an unsupervised task with a
camera in-the-loop results in better generalization to unseen data. We obtain
state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP
datasets. Our code is available at: [Github link upon acceptance, see
supplementary materials].
comment: 17 pages, 7 figures
☆ Vision Transformer with Key-select Routing Attention for Single Image Dehazing
We present Ksformer, utilizing Multi-scale Key-select Routing Attention
(MKRA) for intelligent selection of key areas through multi-channel,
multi-scale windows with a top-k operator, and Lightweight Frequency Processing
Module (LFPM) to enhance high-frequency features, outperforming other dehazing
methods in tests.
comment: 5 pages,4 figures,IEICE Trans. Information and Systems
☆ MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu, Ning Liu, Yaxin Peng, Feifei Feng, Jian Tang
It is fundamentally challenging for robots to serve as useful assistants in
human environments because this requires addressing a spectrum of sub-problems
across robotics, including perception, language understanding, reasoning, and
planning. The recent advancements in Multimodal Large Language Models (MLLMs)
have demonstrated their exceptional abilities in solving complex mathematical
problems, mastering commonsense and abstract reasoning. This has led to the
recent utilization of MLLMs as the brain in robotic systems, enabling these
models to conduct high-level planning prior to triggering low-level control
actions for task execution. However, it remains uncertain whether existing
MLLMs are reliable in serving the brain role of robots. In this study, we
introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo)
benchmark, which tests the capability of MLLMs for robot applications.
Specifically, we identify four essential capabilities perception, task
planning, visual reasoning, and safety measurement that MLLMs must possess to
qualify as the robot's central processing unit. We have developed several
scenarios for each capability, resulting in a total of 14 metrics for
evaluation. We present experimental results for various MLLMs, including both
commercial and open-source models, to assess the performance of existing
systems. Our findings indicate that no single model excels in all areas,
suggesting that current MLLMs are not yet trustworthy enough to serve as the
cognitive core for robots. Our data can be found in
https://mm-robobench.github.io/.
☆ Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation
Niful Islam, Mohaiminul Islam Bhuiyan, Jarin Tasnim Raya, Nur Shazwani Kamarudin, Khan Md Hasib, M. F. Mridha, Dewan Md. Farid
Brain tumors are one of the most common diseases that lead to early death if
not diagnosed at an early stage. Traditional diagnostic approaches are
extremely time-consuming and prone to errors. In this context, computer
vision-based approaches have emerged as an effective tool for accurate brain
tumor classification. While some of the existing solutions demonstrate
noteworthy accuracy, the models become infeasible to deploy in areas where
computational resources are limited. This research addresses the need for
accurate and fast classification of brain tumors with a priority of deploying
the model in technologically underdeveloped regions. The research presents a
novel architecture for precise brain tumor classification fusing pretrained
ResNet152V2 and modified VGG16 models. The proposed architecture undergoes a
diligent fine-tuning process that ensures fine gradients are preserved in deep
neural networks, which are essential for effective brain tumor classification.
The proposed solution incorporates various image processing techniques to
improve image quality and achieves an astounding accuracy of 98.36% and 98.04%
in Figshare and Kaggle datasets respectively. This architecture stands out for
having a streamlined profile, with only 2.8 million trainable parameters. We
have leveraged 8-bit quantization to produce a model of size 73.881 MB,
significantly reducing it from the previous size of 289.45 MB, ensuring smooth
deployment in edge devices even in resource-constrained areas. Additionally,
the use of Grad-CAM improves the interpretability of the model, offering
insightful information regarding its decision-making process. Owing to its high
discriminative ability, this model can be a reliable option for accurate brain
tumor classification.
☆ Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction
Human-AI collaboration to identify and correct perceptual errors in chest
radiographs has not been previously explored. This study aimed to develop a
collaborative AI system, CoRaX, which integrates eye gaze data and radiology
reports to enhance diagnostic accuracy in chest radiology by pinpointing
perceptual errors and refining the decision-making process. Using public
datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX,
employing a large multimodal model to analyze image embeddings, eye gaze data,
and radiology reports. The system's effectiveness was evaluated based on its
referral-making process, the quality of referrals, and performance in
collaborative diagnostic settings. CoRaX was tested on a simulated error
dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system
corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved.
The Referral-Usefulness score, indicating the accuracy of predicted regions for
all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score,
reflecting the diagnostic accuracy of CoRaX's interactions with radiologists,
showed that 84% (237 of 280) of these interactions had a score above 0.40. In
conclusion, CoRaX efficiently collaborates with radiologists to address
perceptual errors across various abnormalities, with potential applications in
the education and training of novice radiologists.
comment: Under Review in Journal
☆ MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
In recent years, generative artificial intelligence has achieved significant
advancements in the field of image generation, spawning a variety of
applications. However, video generation still faces considerable challenges in
various aspects, such as controllability, video length, and richness of
details, which hinder the application and popularization of this technology. In
this work, we propose a controllable video generation framework, dubbed
MimicMotion, which can generate high-quality videos of arbitrary length
mimicking specific motion guidance. Compared with previous methods, our
approach has several highlights. Firstly, we introduce confidence-aware pose
guidance that ensures high frame quality and temporal smoothness. Secondly, we
introduce regional loss amplification based on pose confidence, which
significantly reduces image distortion. Lastly, for generating long and smooth
videos, we propose a progressive latent fusion strategy. By this means, we can
produce videos of arbitrary length with acceptable resource consumption. With
extensive experiments and user studies, MimicMotion demonstrates significant
improvements over previous approaches in various aspects. Detailed results and
comparisons are available on our project page:
https://tencent.github.io/MimicMotion .
☆ Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey
Estimating depth from single RGB images and videos is of widespread interest
due to its applications in many areas, including autonomous driving, 3D
reconstruction, digital entertainment, and robotics. More than 500 deep
learning-based papers have been published in the past 10 years, which indicates
the growing interest in the task. This paper presents a comprehensive survey of
the existing deep learning-based methods, the challenges they address, and how
they have evolved in their architecture and supervision methods. It provides a
taxonomy for classifying the current work based on their input and output
modalities, network architectures, and learning methods. It also discusses the
major milestones in the history of monocular depth estimation, and different
pipelines, datasets, and evaluation metrics used in existing methods.
comment: 46 pages, 10 figures, The paper has been accepted for publication in
ACM Computing Surveys 2024
☆ Beyond First-Order: A Multi-Scale Approach to Finger Knuckle Print Biometrics
Recently, finger knuckle prints (FKPs) have gained attention due to their
rich textural patterns, positioning them as a promising biometric for identity
recognition. Prior FKP recognition methods predominantly leverage first-order
feature descriptors, which capture intricate texture details but fail to
account for structural information. Emerging research, however, indicates that
second-order textures, which describe the curves and arcs of the textures,
encompass this overlooked structural information. This paper introduces a novel
FKP recognition approach, the Dual-Order Texture Competition Network (DOTCNet),
designed to capture texture information in FKP images comprehensively. DOTCNet
incorporates three dual-order texture competitive modules (DTCMs), each
targeting textures at different scales. Each DTCM employs a learnable texture
descriptor, specifically a learnable Gabor filter (LGF), to extract texture
features. By leveraging LGFs, the network extracts first and second order
textures to describe fine textures and structural features thoroughly.
Furthermore, an attention mechanism enhances relevant features in the
first-order features, thereby highlighting significant texture details. For
second-order features, a competitive mechanism emphasizes structural
information while reducing noise from higher-order features. Extensive
experimental results reveal that DOTCNet significantly outperforms several
standard algorithms on the publicly available PolyU-FKP dataset.
☆ PopAlign: Population-Level Alignment for Fair Text-to-Image Generation
Text-to-image (T2I) models achieve high-fidelity generation through extensive
training on large datasets. However, these models may unintentionally pick up
undesirable biases of their training data, such as over-representation of
particular identities in gender or ethnicity neutral prompts. Existing
alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and
Direct Preference Optimization (DPO) fail to address this problem effectively
because they operate on pairwise preferences consisting of individual samples,
while the aforementioned biases can only be measured at a population level. For
example, a single sample for the prompt "doctor" could be male or female, but a
model generating predominantly male doctors even with repeated sampling
reflects a gender bias. To address this limitation, we introduce PopAlign, a
novel approach for population-level preference optimization, while standard
optimization would prefer entire sets of samples over others. We further derive
a stochastic lower bound that directly optimizes for individual samples from
preferred populations over others for scalable training. Using human evaluation
and standard image quality and bias metrics, we show that PopAlign
significantly mitigates the bias of pretrained T2I models while largely
preserving the generation quality. Code is available at
https://github.com/jacklishufan/PopAlignSDXL.
comment: 18 pages, 10 figures
☆ CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion
Hyperspectral imaging, capturing detailed spectral information for each
pixel, is pivotal in diverse scientific and industrial applications. Yet, the
acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to
be addressed due to the hardware limitations of existing imaging systems. A
prevalent workaround involves capturing both a high-resolution multispectral
image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield
the desired HR-HSI. Although deep learning-based methods have shown promising
in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial
model complexities hinder deployment on resource-constrained imaging devices.
This paper introduces a novel knowledge distillation (KD) framework for
HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the
proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency
for constructing Dual Two-Streamed (DTS) network structure, designed to extract
joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully
exploit the spatial and spectral feature representations of LR-HSI and HR-MSI,
we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse
those features to improve the spatial and spectral quality of the reconstructed
HR-HSI. Finally, the proposed KD-based joint loss function is employed to
co-train the teacher and student networks. Our experimental results demonstrate
that the student model not only achieves comparable or superior LR-HSI SR
performance but also significantly reduces the model-size and computational
requirements. This marks a substantial advancement over existing
state-of-the-art methods. The source code is available at
https://github.com/ming053l/CSAKD.
comment: Submitted to TIP 2024
☆ PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation
Video instance segmentation requires detecting, segmenting, and tracking
objects in videos, typically relying on costly video annotations. This paper
introduces a method that eliminates video annotations by utilizing image
datasets. The PM-VIS algorithm is adapted to handle both bounding box and
instance-level pixel annotations dynamically. We introduce ImageNet-bbox to
supplement missing categories in video datasets and propose the PM-VIS+
algorithm to adjust supervision based on annotation types. To enhance accuracy,
we use pseudo masks and semi-supervised optimization techniques on unannotated
video data. This method achieves high video instance segmentation performance
without manual video annotations, offering a cost-effective solution and new
perspectives for video instance segmentation applications. The code will be
available in https://github.com/ldknight/PM-VIS-plus
comment: MIPR 2024
☆ Basketball-SORT: An Association Method for Complex Multi-object Occlusion Problems in Basketball Multi-object Tracking
Recent deep learning-based object detection approaches have led to
significant progress in multi-object tracking (MOT) algorithms. The current MOT
methods mainly focus on pedestrian or vehicle scenes, but basketball sports
scenes are usually accompanied by three or more object occlusion problems with
similar appearances and high-intensity complex motions, which we call complex
multi-object occlusion (CMOO). Here, we propose an online and robust MOT
approach, named Basketball-SORT, which focuses on the CMOO problems in
basketball videos. To overcome the CMOO problem, instead of using the
intersection-over-union-based (IoU-based) approach, we use the trajectories of
neighboring frames based on the projected positions of the players. Our method
designs the basketball game restriction (BGR) and reacquiring Long-Lost IDs
(RLLI) based on the characteristics of basketball scenes, and we also solve the
occlusion problem based on the player trajectories and appearance features.
Experimental results show that our method achieves a Higher Order Tracking
Accuracy (HOTA) score of 63.48$\%$ on the basketball fixed video dataset and
outperforms other recent popular approaches. Overall, our approach solved the
CMOO problem more effectively than recent MOT algorithms.
☆ AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation
Semi-supervised learning (SSL) has shown considerable potential in medical
image segmentation, primarily leveraging consistency regularization and
pseudo-labeling. However, many SSL approaches only pay attention to low-level
consistency and overlook the significance of pseudo-label reliability.
Therefore, in this work, we propose an adversarial self-training consistency
framework (AstMatch). Firstly, we design an adversarial consistency
regularization (ACR) approach to enhance knowledge transfer and strengthen
prediction consistency under varying perturbation intensities. Second, we apply
a feature matching loss for adversarial training to incorporate high-level
consistency regularization. Additionally, we present the pyramid channel
attention (PCA) and efficient channel and spatial attention (ECSA) modules to
improve the discriminator's performance. Finally, we propose an adaptive
self-training (AST) approach to ensure the pseudo-labels' quality. The proposed
AstMatch has been extensively evaluated with cutting-edge SSL methods on three
public-available datasets. The experimental results under different labeled
ratios indicate that AstMatch outperforms other existing methods, achieving new
state-of-the-art performance. Our code will be available at
https://github.com/GuanghaoZhu663/AstMatch.
☆ Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion
Current Event Stream Super-Resolution (ESR) methods overlook the redundant
and complementary information present in positive and negative events within
the event stream, employing a direct mixing approach for super-resolution,
which may lead to detail loss and inefficiency. To address these issues, we
propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet)
that separates positive and negative events for complementary information
extraction, followed by mutual supplementation and refinement. Particularly, we
introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM
is designed for the fusion of contextual information within neighboring event
streams, leveraging the coupling relationship between positive and negative
events to alleviate the misleading of noises in the respective branches. FEM
efficiently promotes the fusion and exchange of information between positive
and negative branches, enabling superior local information enhancement and
global information complementation. Experimental results demonstrate that our
approach achieves over 17% and 31% improvement on synthetic and real datasets,
accompanied by a 2.3X acceleration. Furthermore, we evaluate our method on two
downstream event-driven applications, \emph{i.e.}, object recognition and video
reconstruction, achieving remarkable results that outperform existing methods.
Our code and Supplementary Material are available at
https://github.com/Lqm26/RMFNet.
☆ Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation AAAI 2024
Weakly Supervised Semantic Segmentation (WSSS) employs weak supervision, such
as image-level labels, to train the segmentation model. Despite the impressive
achievement in recent WSSS methods, we identify that introducing weak labels
with high mean Intersection of Union (mIoU) does not guarantee high
segmentation performance. Existing studies have emphasized the importance of
prioritizing precision and reducing noise to improve overall performance. In
the same vein, we propose ORANDNet, an advanced ensemble approach tailored for
WSSS. ORANDNet combines Class Activation Maps (CAMs) from two different
classifiers to increase the precision of pseudo-masks (PMs). To further
mitigate small noise in the PMs, we incorporate curriculum learning. This
involves training the segmentation model initially with pairs of smaller-sized
images and corresponding PMs, gradually transitioning to the original-sized
pairs. By combining the original CAMs of ResNet-50 and ViT, we significantly
improve the segmentation performance over the single-best model and the naive
ensemble model, respectively. We further extend our ensemble method to CAMs
from AMN (ResNet-like) and MCTformer (ViT-like) models, achieving performance
benefits in advanced WSSS models. It highlights the potential of our ORANDNet
as a final add-on module for WSSS models.
comment: 5 pages, 5 figures, accepted in AAAI 2024 Edge Intelligence Workshop
☆ Model Predictive Simulation Using Structured Graphical Models and Transformers
We propose an approach to simulating trajectories of multiple interacting
agents (road users) based on transformers and probabilistic graphical models
(PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline
is based on the MTR model, which predicts multiple future trajectories
conditioned on the past trajectories and static road layout features. We then
improve upon these generated trajectories using a PGM, which contains factors
which encode prior knowledge, such as a preference for smooth trajectories, and
avoidance of collisions with static obstacles and other moving agents. We
perform (approximate) MAP inference in this PGM using the Gauss-Newton method.
Finally we sample $K=32$ trajectories for each of the $N \sim 100$ agents for
the next $T=8 \Delta$ time steps, where $\Delta=10$ is the sampling rate per
second. Following the Model Predictive Control (MPC) paradigm, we only return
the first element of our forecasted trajectories at each step, and then we
replan, so that the simulation can constantly adapt to its changing
environment. We therefore call our approach "Model Predictive Simulation" or
MPS. We show that MPS improves upon the MTR baseline, especially in safety
critical metrics such as collision rate. Furthermore, our approach is
compatible with any underlying forecasting model, and does not require extra
training, so we believe it is a valuable contribution to the community.
comment: Special Mention at the Waymo Sim Agents Challenge 2024
☆ PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation IJCAI 2024
The ascension of Unmanned Aerial Vehicles (UAVs) in various fields
necessitates effective UAV image segmentation, which faces challenges due to
the dynamic perspectives of UAV-captured images. Traditional segmentation
algorithms falter as they cannot accurately mimic the complexity of UAV
perspectives, and the cost of obtaining multi-perspective labeled datasets is
prohibitive. To address these issues, we introduce the PPTFormer, a novel
\textbf{P}seudo Multi-\textbf{P}erspective \textbf{T}rans\textbf{former}
network that revolutionizes UAV image segmentation. Our approach circumvents
the need for actual multi-perspective data by creating pseudo perspectives for
enhanced multi-perspective learning. The PPTFormer network boasts Perspective
Decomposition, novel Perspective Prototypes, and a specialized encoder and
decoder that together achieve superior segmentation results through Pseudo
Multi-Perspective Attention (PMP Attention) and fusion. Our experiments
demonstrate that PPTFormer achieves state-of-the-art performance across five
UAV segmentation datasets, confirming its capability to effectively simulate
UAV flight perspectives and significantly advance segmentation precision. This
work presents a pioneering leap in UAV scene understanding and sets a new
benchmark for future developments in semantic segmentation.
comment: IJCAI 2024
☆ Optimal Video Compression using Pixel Shift Tracking
The Video comprises approximately ~85\% of all internet traffic, but video
encoding/compression is being historically done with hard coded rules, which
has worked well but only to a certain limit. We have seen a surge in video
compression algorithms using ML-based models in the last few years and many of
them have outperformed several legacy codecs. The models range from encoding
video end to end using an ML approach or replacing some intermediate steps in
legacy codecs using ML models to increase the efficiency of those steps.
Optimizing video storage is an essential aspect of video processing, so we
are proposing one of the possible approaches to achieve it is by avoiding
redundant data at each frame. In this paper, we want to introduce the approach
of redundancies removal in subsequent frames for a given video as a main
approach for video compression. We call this method Redundancy Removal using
Shift (R\textsuperscript2S). This method can be utilized across various Machine
Learning model algorithms, and make the compression more accessible and
adaptable. In this study, we have utilized a computer vision-based pixel point
tracking method to identify redundant pixels to encode video for optimal
storage.
☆ A Survey on Deep Clustering: From the Prior Perspective
Facilitated by the powerful feature extraction ability of neural networks,
deep clustering has achieved great success in analyzing high-dimensional and
complex real-world data. The performance of deep clustering methods is affected
by various factors such as network structures and learning objectives. However,
as pointed out in this survey, the essence of deep clustering lies in the
incorporation and utilization of prior knowledge, which is largely ignored by
existing works. From pioneering deep clustering methods based on data structure
assumptions to recent contrastive clustering methods based on data augmentation
invariances, the development of deep clustering intrinsically corresponds to
the evolution of prior knowledge. In this survey, we provide a comprehensive
review of deep clustering methods by categorizing them into six types of prior
knowledge. We find that in general the prior innovation follows two trends,
namely, i) from mining to constructing, and ii) from internal to external.
Besides, we provide a benchmark on five widely-used datasets and analyze the
performance of methods with diverse priors. By providing a novel prior
knowledge perspective, we hope this survey could provide some novel insights
and inspire future research in the deep clustering community.
☆ SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Synthetic data generation has gained significant attention recently for its
utility in training large vision and language models. However, the application
of synthetic data to the training of multimodal context-augmented generation
systems has been relatively unexplored. This gap in existing work is important
because existing vision and language models (VLMs) are not trained specifically
for context-augmented generation. Resources for adapting such models are
therefore crucial for enabling their use in retrieval-augmented generation
(RAG) settings, where a retriever is used to gather relevant information that
is then subsequently provided to a generative model via context augmentation.
To address this challenging problem, we generate SK-VQA: a large synthetic
multimodal dataset containing over 2 million question-answer pairs which
require external knowledge to determine the final answer. Our dataset is both
larger and significantly more diverse than existing resources of its kind,
possessing over 11x more unique questions and containing images from a greater
variety of sources than previously-proposed datasets. Through extensive
experiments, we demonstrate that our synthetic dataset can not only serve as a
challenging benchmark, but is also highly effective for adapting existing
generative multimodal models for context-augmented generation.
♻ ☆ Exploiting Diffusion Prior for Real-World Image Super-Resolution
We present a novel approach to leverage prior knowledge encapsulated in
pre-trained text-to-image diffusion models for blind super-resolution (SR).
Specifically, by employing our time-aware encoder, we can achieve promising
restoration results without altering the pre-trained synthesis model, thereby
preserving the generative prior and minimizing training cost. To remedy the
loss of fidelity caused by the inherent stochasticity of diffusion models, we
employ a controllable feature wrapping module that allows users to balance
quality and fidelity by simply adjusting a scalar value during the inference
process. Moreover, we develop a progressive aggregation sampling strategy to
overcome the fixed-size constraints of pre-trained diffusion models, enabling
adaptation to resolutions of any size. A comprehensive evaluation of our method
using both synthetic and real-world benchmarks demonstrates its superiority
over current state-of-the-art approaches. Code and models are available at
https://github.com/IceClear/StableSR.
comment: Accepted by IJCV'2024. Some Figs are compressed due to size limits.
Uncompressed ver.:
https://github.com/IceClear/StableSR/releases/download/UncompressedPDF/StableSR_IJCV_Uncompressed.pdf.
Project page: https://iceclear.github.io/projects/stablesr/
♻ ☆ EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees UAI 2023
The popularity of text-based CAPTCHA as a security mechanism to protect
websites from automated bots has prompted researches in CAPTCHA solvers, with
the aim of understanding its failure cases and subsequently making CAPTCHAs
more secure. Recently proposed solvers, built on advances in deep learning, are
able to crack even the very challenging CAPTCHAs with high accuracy. However,
these solvers often perform poorly on out-of-distribution samples that contain
visual features different from those in the training set. Furthermore, they
lack the ability to detect and avoid such samples, making them susceptible to
being locked out by defense systems after a certain number of failed attempts.
In this paper, we propose EnSolver, a family of CAPTCHA solvers that use deep
ensemble uncertainty to detect and skip out-of-distribution CAPTCHAs, making it
harder to be detected. We prove novel theoretical bounds on the effectiveness
of our solvers and demonstrate their use with state-of-the-art CAPTCHA solvers.
Our experiments show that the proposed approaches perform well when cracking
CAPTCHA datasets that contain both in-distribution and out-of-distribution
samples.
comment: A previous version of this paper was presented at the Epistemic
Uncertainty - E-pi UAI 2023 Workshop
♻ ☆ Robustness Assessment of a Runway Object Classifier for Safe Aircraft Taxiing SC
Yizhak Elboher, Raya Elsaleh, Omri Isac, Mélanie Ducoffe, Audrey Galametz, Guillaume Povéda, Ryma Boumazouza, Noémie Cohen, Guy Katz
As deep neural networks (DNNs) are becoming the prominent solution for many
computational problems, the aviation industry seeks to explore their potential
in alleviating pilot workload and in improving operational safety. However, the
use of DNNs in this type of safety-critical applications requires a thorough
certification process. This need can be addressed through formal verification,
which provides rigorous assurances -- e.g.,~by proving the absence of certain
mispredictions. In this case-study paper, we demonstrate this process using an
image-classifier DNN currently under development at Airbus and intended for use
during the aircraft taxiing phase. We use formal methods to assess this DNN's
robustness to three common image perturbation types: noise, brightness and
contrast, and some of their combinations. This process entails multiple
invocations of the underlying verifier, which might be computationally
expensive; and we therefore propose a method that leverages the monotonicity of
these robustness properties, as well as the results of past verification
queries, in order to reduce the overall number of verification queries required
by nearly 60%. Our results provide an indication of the level of robustness
achieved by the DNN classifier under study, and indicate that it is
considerably more vulnerable to noise than to brightness or contrast
perturbations.
comment: This is a preprint version of the paper in the proceedings of 43rd
Digital Avionics Systems Conference (DASC)
♻ ☆ Learning to utilize image second-order derivative information for crisp edge detection
Edge detection is a fundamental task in computer vision. It has made great
progress under the development of deep convolutional neural networks (DCNNs),
some of which have achieved a beyond human-level performance. However, recent
top-performing edge detection methods tend to generate thick and noisy edge
lines. In this work, we solve this problem from two aspects: (1) the lack of
prior knowledge regarding image edges, and (2) the issue of imbalanced pixel
distribution. We propose a second-order derivative-based multi-scale contextual
enhancement module (SDMCM) to help the model locate true edge pixels accurately
by introducing the edge prior knowledge. We also construct a hybrid focal loss
function (HFL) to alleviate the imbalanced distribution issue. In addition, we
employ the conditionally parameterized convolution (CondConv) to develop a
novel boundary refinement module (BRM), which can further refine the final
output edge maps. In the end, we propose a U-shape network named LUS-Net which
is based on the SDMCM and BRM for crisp edge detection. We perform extensive
experiments on three standard benchmarks, and the experiment results illustrate
that our method can predict crisp and clean edge maps and achieves
state-of-the-art performance on the BSDS500 dataset (ODS=0.829), NYUD-V2
dataset (ODS=0.768), and BIPED dataset (ODS=0.903).
♻ ☆ DWARF: Disease-weighted network for attention map refinement
The interpretability of deep learning is crucial for evaluating the
reliability of medical imaging models and reducing the risks of inaccurate
patient recommendations. This study addresses the "human out of the loop" and
"trustworthiness" issues in medical image analysis by integrating medical
professionals into the interpretability process. We propose a disease-weighted
attention map refinement network (DWARF) that leverages expert feedback to
enhance model relevance and accuracy. Our method employs cyclic training to
iteratively improve diagnostic performance, generating precise and
interpretable feature maps. Experimental results demonstrate significant
improvements in interpretability and diagnostic accuracy across multiple
medical imaging datasets. This approach fosters effective collaboration between
AI systems and healthcare professionals, ultimately aiming to improve patient
outcomes
♻ ☆ Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking
This paper tackles the problem of passive gaze estimation using both event
and frame data. Considering the inherently different physiological structures,
it is intractable to accurately estimate gaze purely based on a given state.
Thus, we reformulate gaze estimation as the quantification of the state
shifting from the current state to several prior registered anchor states.
Specifically, we propose a two-stage learning-based gaze estimation framework
that divides the whole gaze estimation process into a coarse-to-fine approach
involving anchor state selection and final gaze location. Moreover, to improve
the generalization ability, instead of learning a large gaze estimation network
directly, we align a group of local experts with a student network, where a
novel denoising distillation algorithm is introduced to utilize denoising
diffusion techniques to iteratively remove inherent noise in event data.
Extensive experiments demonstrate the effectiveness of the proposed method,
which surpasses state-of-the-art methods by a large margin of 15$\%$. The code
will be publicly available at
https://github.com/jdjdli/Denoise_distill_EF_gazetracker.
♻ ☆ Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)
Reinforcement learning (RL) for robot control typically requires a detailed
representation of the environment state, including information about
task-relevant objects not directly measurable. Keypoint detectors, such as
spatial autoencoders (SAEs), are a common approach to extracting a
low-dimensional representation from high-dimensional image data. SAEs aim at
spatial features such as object positions, which are often useful
representations in robotic RL. However, whether an SAE is actually able to
track objects in the scene and thus yields a spatial state representation well
suited for RL tasks has rarely been examined due to a lack of established
metrics. In this paper, we propose to assess the performance of an SAE instance
by measuring how well keypoints track ground truth objects in images. We
present a computationally lightweight metric and use it to evaluate common
baseline SAE architectures on image data from a simulated robot task. We find
that common SAEs differ substantially in their spatial extraction capability.
Furthermore, we validate that SAEs that perform well in our metric achieve
superior performance when used in downstream RL. Thus, our metric is an
effective and lightweight indicator of RL performance before executing
expensive RL training. Building on these insights, we identify three key
modifications of SAE architectures to improve tracking performance. We make our
code available at anonymous.4open.science/r/sae-rl.
comment: 19 pages, 12 figures
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various
generation tasks by leveraging latent variables to learn data distributions and
generate high-quality samples. Despite the field of explainable AI making
strides in interpreting machine learning models, understanding latent variables
in generative models remains challenging. This paper introduces
LatentExplainer, a framework for automatically generating semantically
meaningful explanations of latent variables in deep generative models.
LatentExplainer tackles three main challenges: inferring the meaning of latent
variables, aligning explanations with inductive biases, and handling varying
degrees of explainability. By perturbing latent variables and interpreting
changes in generated data, the framework provides a systematic approach to
understanding and controlling the data generation process, enhancing the
transparency and interpretability of deep generative models. We evaluate our
proposed method on several real-world and synthetic datasets, and the results
demonstrate superior performance in generating high-quality explanations of
latent variables.
♻ ☆ Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning
Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive
zero-shot capability. The key to improve the adaptation of CLIP to downstream
task with few exemplars lies in how to effectively model and transfer the
useful knowledge embedded in CLIP. Previous work mines the knowledge typically
based on the limited visual samples and close-set semantics (i.e., within
target category set of downstream task). However, the aligned CLIP image/text
encoders contain abundant relationships between visual features and almost
infinite open semantics, which may benefit the few-shot learning but remains
unexplored. In this paper, we propose to mine open semantics as anchors to
perform a relation transition from image-anchor relationship to image-target
relationship to make predictions. Specifically, we adopt a transformer module
which takes the visual feature as "Query", the text features of the anchors as
"Key" and the similarity matrix between the text features of anchor and target
classes as "Value". In this way, the output of such a transformer module
represents the relationship between the image and target categories, i.e., the
classification predictions. To avoid manually selecting the open semantics, we
make the [CLASS] token of input text embedding learnable. We conduct extensive
experiments on eleven representative classification datasets. The results show
that our method performs favorably against previous state-of-the-arts
considering few-shot classification settings.
♻ ☆ Kandinsky 3.0 Technical Report
Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, Denis Dimitrov
We present Kandinsky 3.0, a large-scale text-to-image generation model based
on latent diffusion, continuing the series of text-to-image Kandinsky models
and reflecting our progress to achieve higher quality and realism of image
generation. In this report we describe the architecture of the model, the data
collection procedure, the training technique, and the production system for
user interaction. We focus on the key components that, as we have identified as
a result of a large number of experiments, had the most significant impact on
improving the quality of our model compared to the others. We also describe
extensions and applications of our model, including super resolution,
inpainting, image editing, image-to-video generation, and a distilled version
of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the
reverse process and 20 times faster without visual quality decrease. By
side-by-side human preferences comparison, Kandinsky becomes better in text
understanding and works better on specific domains. The code is available at
https://github.com/ai-forever/Kandinsky-3
comment: Project page: https://ai-forever.github.io/Kandinsky-3
♻ ☆ Deformable MRI Sequence Registration for AI-based Prostate Cancer Diagnosis
Alessa Hering, Sarah de Boer, Anindo Saha, Jasper J. Twilt, Mattias P. Heinrich, Derya Yakar, Maarten de Rooij, Henkjan Huisman, Joeran S. Bosma
The PI-CAI (Prostate Imaging: Cancer AI) challenge led to expert-level
diagnostic algorithms for clinically significant prostate cancer detection. The
algorithms receive biparametric MRI scans as input, which consist of
T2-weighted and diffusion-weighted scans. These scans can be misaligned due to
multiple factors in the scanning process. Image registration can alleviate this
issue by predicting the deformation between the sequences. We investigate the
effect of image registration on the diagnostic performance of AI-based prostate
cancer diagnosis. First, the image registration algorithm, developed in
MeVisLab, is analyzed using a dataset with paired lesion annotations. Second,
the effect on diagnosis is evaluated by comparing case-level cancer diagnosis
performance between using the original dataset, rigidly aligned
diffusion-weighted scans, or deformably aligned diffusion-weighted scans. Rigid
registration showed no improvement. Deformable registration demonstrated a
substantial improvement in lesion overlap (+10% median Dice score) and a
positive yet non-significant improvement in diagnostic performance (+0.3%
AUROC, p=0.18). Our investigation shows that a substantial improvement in
lesion alignment does not directly lead to a significant improvement in
diagnostic performance. Qualitative analysis indicated that jointly developing
image registration methods and diagnostic AI algorithms could enhance
diagnostic accuracy and patient outcomes.
♻ ☆ Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
Self-supervised features are the cornerstone of modern machine learning
systems. They are typically pre-trained on data collections whose construction
and curation typically require extensive human effort. This manual process has
some limitations similar to those encountered in supervised learning, e.g., the
crowd-sourced selection of data is costly and time-consuming, preventing
scaling the dataset size. In this work, we consider the problem of automatic
curation of high-quality datasets for self-supervised pre-training. We posit
that such datasets should be large, diverse and balanced, and propose a
clustering-based approach for building ones satisfying all these criteria. Our
method involves successive and hierarchical applications of $k$-means on a
large and diverse data repository to obtain clusters that distribute uniformly
among data concepts, followed by a hierarchical, balanced sampling step from
these clusters. Extensive experiments on three different data domains including
web-based images, satellite images and text show that features trained on our
automatically curated datasets outperform those trained on uncurated data while
being on par or better than ones trained on manually curated data. Code is
available at https://github.com/facebookresearch/ssl-data-curation.
♻ ☆ LiverUSRecon: Automatic 3D Reconstruction and Volumetry of the Liver with a Few Partial Ultrasound Scans MICCAI 2024
Kaushalya Sivayogaraj, Sahan T. Guruge, Udari Liyanage, Jeevani Udupihille, Saroj Jayasinghe, Gerard Fernando, Ranga Rodrigo, M. Rukshani Liyanaarachchi
3D reconstruction of the liver for volumetry is important for qualitative
analysis and disease diagnosis. Liver volumetry using ultrasound (US) scans,
although advantageous due to less acquisition time and safety, is challenging
due to the inherent noisiness in US scans, blurry boundaries, and partial liver
visibility. We address these challenges by using the segmentation masks of a
few incomplete sagittal-plane US scans of the liver in conjunction with a
statistical shape model (SSM) built using a set of CT scans of the liver. We
compute the shape parameters needed to warp this canonical SSM to fit the US
scans through a parametric regression network. The resulting 3D liver
reconstruction is accurate and leads to automatic liver volume calculation. We
evaluate the accuracy of the estimated liver volumes with respect to CT
segmentation volumes using RMSE. Our volume computation is statistically much
closer to the volume estimated using CT scans than the volume computed using
Childs' method by radiologists: p-value of 0.094 (>0.05) says that there is no
significant difference between CT segmentation volumes and ours in contrast to
Childs' method. We validate our method using investigations (ablation studies)
on the US image resolution, the number of CT scans used for SSM, the number of
principal components, and the number of input US scans. To the best of our
knowledge, this is the first automatic liver volumetry system using a few
incomplete US scans given a set of CT scans of livers for SSM.
comment: 10 pages, Accepted to MICCAI 2024
♻ ☆ Cross-domain Denoising for Low-dose Multi-frame Spiral Computed Tomography
Computed tomography (CT) has been used worldwide as a non-invasive test to
assist in diagnosis. However, the ionizing nature of X-ray exposure raises
concerns about potential health risks such as cancer. The desire for lower
radiation doses has driven researchers to improve reconstruction quality.
Although previous studies on low-dose computed tomography (LDCT) denoising have
demonstrated the effectiveness of learning-based methods, most were developed
on the simulated data. However, the real-world scenario differs significantly
from the simulation domain, especially when using the multi-slice spiral
scanner geometry. This paper proposes a two-stage method for the commercially
available multi-slice spiral CT scanners that better exploits the complete
reconstruction pipeline for LDCT denoising across different domains. Our
approach makes good use of the high redundancy of multi-slice projections and
the volumetric reconstructions while leveraging the over-smoothing problem in
conventional cascaded frameworks caused by aggressive denoising. The dedicated
design also provides a more explicit interpretation of the data flow. Extensive
experiments on various datasets showed that the proposed method could remove up
to 70\% of noise without compromised spatial resolution, and subjective
evaluations by two experienced radiologists further supported its superior
performance against state-of-the-art methods in clinical practice.
♻ ☆ Defect Detection in Synthetic Fibre Ropes using Detectron2 Framework
Fibre ropes with the latest technology have emerged as an appealing
alternative to steel ropes for offshore industries due to their lightweight and
high tensile strength. At the same time, frequent inspection of these ropes is
essential to ensure the proper functioning and safety of the entire system. The
development of deep learning (DL) models in condition monitoring (CM)
applications offers a simpler and more effective approach for defect detection
in synthetic fibre ropes (SFRs). The present paper investigates the performance
of Detectron2, a state-of-the-art library for defect detection and instance
segmentation. Detectron2 with Mask R-CNN architecture is used for segmenting
defects in SFRs. Mask R-CNN with various backbone configurations has been
trained and tested on an experimentally obtained dataset comprising 1,803
high-dimensional images containing seven damage classes (placking high,
placking medium, placking low, compression, core out, chafing, and normal
respectively) for SFRs. By leveraging the capabilities of Detectron2, this
study aims to develop an automated and efficient method for detecting defects
in SFRs, enhancing the inspection process, and ensuring the safety of the fibre
ropes.
comment: 12 pages, 8 figures, 4 tables
♻ ☆ Assessment of Sentinel-2 spatial and temporal coverage based on the scene classification layer
Since the launch of the Sentinel-2 (S2) satellites, many ML models have used
the data for diverse applications. The scene classification layer (SCL) inside
the S2 product provides rich information for training, such as filtering images
with high cloud coverage. However, there is more potential in this. We propose
a technique to assess the clean optical coverage of a region, expressed by a
SITS and calculated with the S2-based SCL data. With a manual threshold and
specific labels in the SCL, the proposed technique assigns a percentage of
spatial and temporal coverage across the time series and a high/low assessment.
By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the
assessment is correlated to the predictive results of ML models. The
classification results in a region with low spatial and temporal coverage is
worse than in a region with high coverage. Finally, we applied the technique
across all continents of the global dataset LandCoverNet.
comment: Accepted at IEEE International Geoscience and Remote Sensing
Symposium 2024
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader
applications of large vision-language models (LVLMs). Object hallucination
refers to the phenomenon that the LVLMs claim non-existent objects in the
image. To mitigate the object hallucinations, instruction tuning and external
model-based detection methods have been proposed, which either require
large-scare computational resources or depend on the detection result of
external models. However, there remains an under-explored field to utilize the
LVLM itself to alleviate object hallucinations. In this work, we adopt the
intuition that the LVLM tends to respond logically consistently for existent
objects but inconsistently for hallucinated objects. Therefore, we propose a
Logical Closed Loop-based framework for Object Hallucination Detection and
Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency
probing to raise questions with logical correlations, inquiring about
attributes from objects and vice versa. Whether their responses can form a
logical closed loop serves as an indicator of object hallucination. As a
plug-and-play method, it can be seamlessly applied to all existing LVLMs.
Comprehensive experiments conducted on three benchmarks across four LVLMs have
demonstrated significant improvements brought by our method, indicating its
effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information
Volumetric video, also known as hologram video, is a novel medium that
portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and
Mixed Reality (MR). It is expected to be the next-gen video technology and a
prevalent use case for 5G and beyond wireless communication. Considering that
each user typically only watches a section of the volumetric video, known as
the viewport, it is essential to have precise viewport prediction for optimal
performance. However, research on this topic is still in its infancy. In the
end, this paper presents and proposes a novel approach, named Saliency and
Trajectory Viewport Prediction (STVP), which aims to improve the precision of
viewport prediction in volumetric video streaming. The STVP extensively
utilizes video saliency information and viewport trajectory. To our knowledge,
this is the first comprehensive study of viewport prediction in volumetric
video streaming. In particular, we introduce a novel sampling method, Uniform
Random Sampling (URS), to reduce computational complexity while still
preserving video features in an efficient manner. Then we present a saliency
detection technique that incorporates both spatial and temporal information for
detecting static, dynamic geometric, and color salient regions. Finally, we
intelligently fuse saliency and trajectory information to achieve more accurate
viewport prediction. We conduct extensive simulations to evaluate the
effectiveness of our proposed viewport prediction methods using
state-of-the-art volumetric video sequences. The experimental results show the
superiority of the proposed method over existing schemes. The dataset and
source code will be publicly accessible after acceptance.
♻ ☆ FAGhead: Fully Animate Gaussian Head from Monocular Videos
High-fidelity reconstruction of 3D human avatars has a wild application in
visual reality. In this paper, we introduce FAGhead, a method that enables
fully controllable human portraits from monocular videos. We explicit the
traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to
reconstruct with complex expressions. Furthermore, we employ a novel
Point-based Learnable Representation Field (PLRF) with learnable Gaussian point
positions to enhance reconstruction performance. Meanwhile, to effectively
manage the edges of avatars, we introduced the alpha rendering to supervise the
alpha value of each pixel. Extensive experimental results on the open-source
datasets and our capturing datasets demonstrate that our approach is able to
generate high-fidelity 3D head avatars and fully control the expression and
pose of the virtual avatars, which is outperforming than existing works.
♻ ☆ A Refer-and-Ground Multimodal Large Language Model for Biomedicine MICCAI2024
With the rapid development of multimodal large language models (MLLMs),
especially their capabilities in visual chat through refer and ground
functionalities, their significance is increasingly recognized. However, the
biomedical field currently exhibits a substantial gap in this area, primarily
due to the absence of a dedicated refer and ground dataset for biomedical
images. To address this challenge, we devised the Med-GRIT-270k dataset. It
comprises 270k question-and-answer pairs and spans eight distinct medical
imaging modalities. Most importantly, it is the first dedicated to the
biomedical domain and integrating refer and ground conversations. The key idea
is to sample large-scale biomedical image-mask pairs from medical segmentation
datasets and generate instruction datasets from text using chatGPT.
Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model
for Biomedicine (BiRD) by using this dataset and multi-task instruction
learning. Extensive experiments have corroborated the efficacy of the
Med-GRIT-270k dataset and the multi-modal, fine-grained interactive
capabilities of the BiRD model. This holds significant reference value for the
exploration and development of intelligent biomedical assistants.
comment: Accepted by MICCAI2024
♻ ☆ ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion
Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively
convenient and inexpensive device, which has been demonstrated to be applicable
in place of RGB cameras in human indoor pose estimation tasks. However, mmWave
radar relies on the collection of reflected signals from the target, and the
radar signals containing information is difficult to be fully applied. This has
been a long-standing hindrance to the improvement of pose estimation accuracy.
To address this major challenge, this paper introduces a probability map guided
multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature
extraction framework using a traditional FFT method in parallel with a
probability map based positional encoding method. ProbRadarM3F fuses the
traditional heatmap features and the positional features, then effectively
achieves the estimation of 14 keypoints of the human body. Experimental
evaluation on the HuPR dataset proves the effectiveness of the model proposed
in this paper, outperforming other methods experimented on this dataset with an
AP of 69.9 %. The emphasis of our study is focusing on the position information
that is not exploited before in radar singal. This provides direction to
investigate other potential non-redundant information from mmWave rader.
♻ ☆ SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues MICCAI 2024
Weakly-supervised medical image segmentation is a challenging task that aims
to reduce the annotation cost while keep the segmentation performance. In this
paper, we present a novel framework, SimTxtSeg, that leverages simple text cues
to generate high-quality pseudo-labels and study the cross-modal fusion in
training segmentation models, simultaneously. Our contribution consists of two
key components: an effective Textual-to-Visual Cue Converter that produces
visual prompts from text prompts on medical images, and a text-guided
segmentation model with Text-Vision Hybrid Attention that fuses text and image
features. We evaluate our framework on two medical image segmentation tasks:
colonic polyp segmentation and MRI brain tumor segmentation, and achieve
consistent state-of-the-art performance.
comment: accepted by MICCAI 2024
♻ ☆ Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency
Skin cancer is a major concern to public health, accounting for one-third of
the reported cancers. If not detected early, the cancer has the potential for
severe consequences. Recognizing the critical need for effective skin cancer
classification, we address the limitations of existing models, which are often
too large to deploy in areas with limited computational resources. In response,
we present a knowledge distillation based approach for creating a lightweight
yet high-performing classifier. The proposed solution involves fusing three
models, namely ResNet152V2, ConvNeXtBase, and ViT Base, to create an effective
teacher model. The teacher model is then employed to guide a lightweight
student model of size 2.03 MB. This student model is further compressed to
469.77 KB using 16-bit quantization, enabling smooth incorporation into edge
devices. With six-stage image preprocessing, data augmentation, and a rigorous
ablation study, the model achieves an impressive accuracy of 98.75% on the
HAM10000 dataset and 98.94% on the Kaggle dataset in classifying benign and
malignant skin cancers. With its high accuracy and compact size, our model
appears to be a potential choice for accurate skin cancer classification,
particularly in resource-constrained settings.
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth
Existing benchmarks for visual question answering lack in visual grounding
and complexity, particularly in evaluating spatial reasoning skills. We
introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of
visual question-answering multimodal language models in reasoning with
flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and
human-verified flowchart images from three distinct content sources, along with
22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks,
including information localization, decision-making, and logical progression.
We conduct a thorough baseline evaluation on a suite of both open-source and
proprietary multimodal language models using various strategies, followed by an
analysis of directional bias. The results underscore the benchmark's potential
as a vital tool for advancing the field of multimodal modeling, providing a
focused and challenging environment for enhancing model performance in visual
and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ CSI4Free: GAN-Augmented mmWave CSI for Improved Pose Classification
In recent years, Joint Communication and Sensing (JC&S), has demonstrated
significant success, particularly in utilizing sub-6 GHz frequencies with
commercial-off-the-shelf (COTS) Wi-Fi devices for applications such as
localization, gesture recognition, and pose classification. Deep learning and
the existence of large public datasets has been pivotal in achieving such
results. However, at mmWave frequencies (30-300 GHz), which has shown potential
for more accurate sensing performance, there is a noticeable lack of research
in the domain of COTS Wi-Fi sensing. Challenges such as limited research
hardware, the absence of large datasets, limited functionality in COTS
hardware, and the complexities of data collection present obstacles to a
comprehensive exploration of this field. In this work, we aim to address these
challenges by developing a method that can generate synthetic mmWave channel
state information (CSI) samples. In particular, we use a generative adversarial
network (GAN) on an existing dataset, to generate 30,000 additional CSI
samples. The augmented samples exhibit a remarkable degree of consistency with
the original data, as indicated by the notably high GAN-train and GAN-test
scores. Furthermore, we integrate the augmented samples in training a pose
classification model. We observe that the augmented samples complement the real
data and improve the generalization of the classification model.
♻ ☆ All-In-One Medical Image Restoration via Task-Adaptive Routing MICCAI 2024
Although single-task medical image restoration (MedIR) has witnessed
remarkable success, the limited generalizability of these methods poses a
substantial obstacle to wider application. In this paper, we focus on the task
of all-in-one medical image restoration, aiming to address multiple distinct
MedIR tasks with a single universal model. Nonetheless, due to significant
differences between different MedIR tasks, training a universal model often
encounters task interference issues, where different tasks with shared
parameters may conflict with each other in the gradient update direction. This
task interference leads to deviation of the model update direction from the
optimal path, thereby affecting the model's performance. To tackle this issue,
we propose a task-adaptive routing strategy, allowing conflicting tasks to
select different network paths in spatial and channel dimensions, thereby
mitigating task interference. Experimental results demonstrate that our
proposed \textbf{A}ll-in-one \textbf{M}edical \textbf{I}mage
\textbf{R}estoration (\textbf{AMIR}) network achieves state-of-the-art
performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET
synthesis, both in single-task and all-in-one settings. The code and data will
be available at
\href{https://github.com/Yaziwel/All-In-One-Medical-Image-Restoration-via-Task-Adaptive-Routing.git}{https://github.com/Yaziwel/AMIR}.
comment: This article has been early accepted by MICCAI 2024
♻ ☆ Revisiting Backdoor Attacks against Large Vision-Language Models
Instruction tuning enhances large vision-language models (LVLMs) but raises
security risks through potential backdoor attacks due to their openness.
Previous backdoor studies focus on enclosed scenarios with consistent training
and testing instructions, neglecting the practical domain gaps that could
affect attack effectiveness. This paper empirically examines the
generalizability of backdoor attacks during the instruction tuning of LVLMs for
the first time, revealing certain limitations of most backdoor strategies in
practical scenarios. We quantitatively evaluate the generalizability of six
typical backdoor attacks on image caption benchmarks across multiple LVLMs,
considering both visual and textual domain offsets. Our findings indicate that
attack generalizability is positively correlated with the backdoor trigger's
irrelevance to specific images/models and the preferential correlation of the
trigger pattern. Additionally, we modify existing backdoor attacks based on the
above key observations, demonstrating significant improvements in cross-domain
scenario generalizability (+86% attack success rate). Notably, even without
access to the instruction datasets, a multimodal instruction set can be
successfully poisoned with a very low poisoning rate (0.2%), achieving an
attack success rate of over 97%. This paper underscores that even simple
traditional backdoor strategies pose a serious threat to LVLMs, necessitating
more attention and in-depth research.
comment: 23 pages, 8 figures
♻ ☆ Generative Autoencoding of Dropout Patterns
We propose a generative model termed Deciphering Autoencoders. In this model,
we assign a unique random dropout pattern to each data point in the training
dataset and then train an autoencoder to reconstruct the corresponding data
point using this pattern as information to be encoded. Even if a completely
random dropout pattern is assigned to each data point regardless of their
similarities, a sufficiently large encoder can smoothly map them to a
low-dimensional latent space to reconstruct individual training data points.
During inference, using a dropout pattern different from those used during
training allows the model to function as a generator. Since the training of
Deciphering Autoencoders relies solely on reconstruction error, it offers more
stable training compared to other generative models. Despite their simplicity,
Deciphering Autoencoders show sampling quality comparable to DCGAN on the
CIFAR-10 dataset.
♻ ☆ EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation CVPR 2024
Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei Huang, Yali Wang, Tong Lu, Limin Wang, Yu Qiao
In this report, we present our solutions to the EgoVis Challenges in CVPR
2024, including five tracks in the Ego4D challenge and three tracks in the
EPIC-Kitchens challenge. Building upon the video-language two-tower model and
leveraging our meticulously organized egocentric video data, we introduce a
novel foundation model called EgoVideo. This model is specifically designed to
cater to the unique characteristics of egocentric videos and provides strong
support for our competition submissions. In the Ego4D challenges, we tackle
various tasks including Natural Language Queries, Step Grounding, Moment
Queries, Short-term Object Interaction Anticipation, and Long-term Action
Anticipation. In addition, we also participate in the EPIC-Kitchens challenge,
where we engage in the Action Recognition, Multiple Instance Retrieval, and
Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these
diverse tasks, we showcase its versatility and effectiveness in different
egocentric video analysis scenarios, demonstrating the powerful representation
ability of EgoVideo as an egocentric foundation model. Our codebase and
pretrained models are publicly available at
https://github.com/OpenGVLab/EgoVideo.
comment: Champion solutions in the EgoVis CVPR 2024 workshop
♻ ☆ AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models
We consider the problem of customizing text-to-image diffusion models with
user-supplied reference images. Given new prompts, the existing methods can
capture the key concept from the reference images but fail to align the
generated image with the prompt. In this work, we seek to address this key
issue by proposing new methods that can easily be used in conjunction with
existing customization methods that optimize the embeddings/weights at various
intermediate stages of the text encoding process.
The first contribution of this paper is a dissection of the various stages of
the text encoding process leading up to the conditioning vector for
text-to-image models. We take a holistic view of existing customization methods
and notice that key and value outputs from this process differs substantially
from their corresponding baseline (non-customized) models (e.g., baseline
stable diffusion). While this difference does not impact the concept being
customized, it leads to other parts of the generated image not being aligned
with the prompt. Further, we also observe that these keys and values allow
independent control various aspects of the final generation, enabling semantic
manipulation of the output. Taken together, the features spanning these keys
and values, serve as the basis for our next contribution where we fix the
aforementioned issues with existing methods. We propose a new post-processing
algorithm, AlignIT, that infuses the keys and values for the concept of
interest while ensuring the keys and values for all other tokens in the input
prompt are unchanged.
Our proposed method can be plugged in directly to existing customization
methods, leading to a substantial performance improvement in the alignment of
the final result with the input prompt while retaining the customization
quality.
comment: 10 pages, 9 figures
♻ ☆ Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Customized image generation, which seeks to synthesize images with consistent
characters, holds significant relevance for applications such as storytelling,
portrait generation, and character design. However, previous approaches have
encountered challenges in preserving characters with high-fidelity consistency
due to inadequate feature extraction and concept confusion of reference
characters. Therefore, we propose Character-Adapter, a plug-and-play framework
designed to generate images that preserve the details of reference characters,
ensuring high-fidelity consistency. Character-Adapter employs prompt-guided
segmentation to ensure fine-grained regional features of reference characters
and dynamic region-level adapters to mitigate concept confusion. Extensive
experiments are conducted to validate the effectiveness of Character-Adapter.
Both quantitative and qualitative results demonstrate that Character-Adapter
achieves the state-of-the-art performance of consistent character generation,
with an improvement of 24.8% compared with other methods. Our code will be
released at https://github.com/Character-Adapter/Character-Adapte
♻ ☆ MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension
Khiem Le, Zhichun Guo, Kaiwen Dong, Xiaobao Huang, Bozhao Nan, Roshni Iyer, Xiangliang Zhang, Olaf Wiest, Wei Wang, Nitesh V. Chawla
Recently, Large Language Models (LLMs) with their strong task-handling
capabilities have shown remarkable advancements across a spectrum of fields,
moving beyond natural language understanding. However, their proficiency within
the chemistry domain remains restricted, especially in solving professional
molecule-related tasks. This challenge is attributed to their inherent
limitations in comprehending molecules using only common textual
representations, i.e., SMILES strings. In this study, we seek to enhance the
ability of LLMs to comprehend molecules by designing and equipping them with a
multi-modal external module, namely MolX. In particular, instead of directly
using a SMILES string to represent a molecule, we utilize specific encoders to
extract fine-grained features from both SMILES string and 2D molecular graph
representations for feeding into an LLM. Moreover, a human-defined molecular
fingerprint is incorporated to leverage its embedded domain knowledge. Then, to
establish an alignment between MolX and the LLM's textual input space, the
whole model in which the LLM is frozen, is pre-trained with a versatile
strategy including a diverse set of tasks. Extensive experimental evaluations
demonstrate that our proposed method only introduces a small number of
trainable parameters while outperforming baselines on various downstream
molecule-related tasks ranging from molecule-to-text translation to
retrosynthesis, with and without fine-tuning the LLM.
♻ ☆ AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
The field of text-to-image (T2I) generation has made significant progress in
recent years, largely driven by advancements in diffusion models. Linguistic
control enables effective content creation, but struggles with fine-grained
control over image generation. This challenge has been explored, to a great
extent, by incorporating additional user-supplied spatial conditions, such as
depth maps and edge maps, into pre-trained T2I models through extra encoding.
However, multi-control image synthesis still faces several challenges.
Specifically, current approaches are limited in handling free combinations of
diverse input control signals, overlook the complex relationships among
multiple spatial conditions, and often fail to maintain semantic alignment with
provided textual prompts. This can lead to suboptimal user experiences. To
address these challenges, we propose AnyControl, a multi-control image
synthesis framework that supports arbitrary combinations of diverse control
signals. AnyControl develops a novel Multi-Control Encoder that extracts a
unified multi-modal embedding to guide the generation process. This approach
enables a holistic understanding of user inputs, and produces high-quality,
faithful results under versatile control signals, as demonstrated by extensive
quantitative and qualitative evaluations. Our project page is available in
https://any-control.github.io.
♻ ☆ Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
Large-scale endeavors like RT-1 and widespread community efforts such as
Open-X-Embodiment have contributed to growing the scale of robot demonstration
data. However, there is still an opportunity to improve the quality, quantity,
and diversity of robot demonstration data. Although vision-language models have
been shown to automatically generate demonstration data, their utility has been
limited to environments with privileged state information, they require
hand-designed skills, and are limited to interactions with few object
instances. We propose Manipulate-Anything, a scalable automated generation
method for real-world robotic manipulation. Unlike prior work, our method can
operate in real-world environments without any privileged state information,
hand-designed skills, and can manipulate any static object. We evaluate our
method using two setups. First, Manipulate-Anything successfully generates
trajectories for all 5 real-world and 12 simulation tasks, significantly
outperforming existing methods like VoxPoser. Second, Manipulate-Anything's
demonstrations can train more robust behavior cloning policies than training
with human demonstrations, or from data generated by VoxPoser and
Code-As-Policies. We believe Manipulate-Anything can be the scalable method for
both generating data for robotics and solving novel tasks in a zero-shot
setting.
comment: Project page: https://robot-ma.github.io/
♻ ☆ Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration MICCAI 2024
Long Lei, Jun Zhou, Jialun Pei, Baoliang Zhao, Yueming Jin, Yuen-Chun Jeremy Teoh, Jing Qin, Pheng-Ann Heng
A comprehensive guidance view for cardiac interventional surgery can be
provided by the real-time fusion of the intraoperative 2D images and
preoperative 3D volume based on the ultrasound frame-to-volume registration.
However, cardiac ultrasound images are characterized by a low signal-to-noise
ratio and small differences between adjacent frames, coupled with significant
dimension variations between 2D frames and 3D volumes to be registered,
resulting in real-time and accurate cardiac ultrasound frame-to-volume
registration being a very challenging task. This paper introduces a lightweight
end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed
CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided
anatomical clues to reinforce the interaction of 2D sparse and 3D dense
features, followed by a voxel-wise local-global aggregation of enhanced
features, thereby boosting the cross-dimensional matching effectiveness of
low-quality ultrasound modalities. We further embed an inter-frame
discriminative regularization term within the hybrid supervised learning to
increase the distinction between adjacent slices in the same ultrasound volume
to ensure registration stability. Experimental results on the reprocessed CAMUS
dataset demonstrate that our CU-Reg surpasses existing methods in terms of
registration accuracy and efficiency, meeting the guidance requirements of
clinical cardiac interventional surgery.
comment: This paper has been accepted by MICCAI 2024
♻ ☆ Solving the Inverse Problem of Electrocardiography for Cardiac Digital Twins: A Survey
Cardiac digital twins are personalized virtual representations used to
understand complex heart mechanisms. Solving the ECG inverse problem is crucial
for accurate virtual heart modelling, enabling the derivation of internal
electrical activity information from recorded surface potentials. Despite
challenges from cardiac complexity, noisy ECG data, and computational
efficiency, recent advancements hold significant promise for enhancing virtual
heart modelling, ultimately advancing precision medicine in cardiology. This
paper aims to provide a comprehensive review of the methods of solving ECG
inverse problem, the validation strategies, the clinical applications, and
future perspectives. For the computing methodologies, we broadly classify
state-of-the-art approaches into two categories: deterministic and
probabilistic methods, including conventional and deep learning-based
techniques. Integrating physics laws with deep learning models holds promise,
but challenges such as capturing dynamic electrophysiology accurately,
accessing accurate domain knowledge, and quantifying prediction uncertainty
persist. Integrating models into clinical workflows while ensuring
interpretability and usability for healthcare professionals is essential.
Overcoming these challenges will drive further research in cardiac digital
twins.
♻ ☆ Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID CVPR 2024
Text-to-image person re-identification (ReID) retrieves pedestrian images
according to textual descriptions. Manually annotating textual descriptions is
time-consuming, restricting the scale of existing datasets and therefore the
generalization ability of ReID models. As a result, we study the transferable
text-to-image ReID problem, where we train a model on our proposed large-scale
database and directly deploy it to various datasets for evaluation. We obtain
substantial training data via Multi-modal Large Language Models (MLLMs).
Moreover, we identify and address two key challenges in utilizing the obtained
textual descriptions. First, an MLLM tends to generate descriptions with
similar structures, causing the model to overfit specific sentence patterns.
Thus, we propose a novel method that uses MLLMs to caption images according to
various templates. These templates are obtained using a multi-turn dialogue
with a Large Language Model (LLM). Therefore, we can build a large-scale
dataset with diverse textual descriptions. Second, an MLLM may produce
incorrect descriptions. Hence, we introduce a novel method that automatically
identifies words in a description that do not correspond with the image. This
method is based on the similarity between one text and all patch token
embeddings in the image. Then, we mask these words with a larger probability in
the subsequent training epoch, alleviating the impact of noisy textual
descriptions. The experimental results demonstrate that our methods
significantly boost the direct transfer text-to-image ReID performance.
Benefiting from the pre-trained model weights, we also achieve state-of-the-art
performance in the traditional evaluation settings.
comment: CVPR 2024